Cabify Analysis

A simple analysis of how Cabify's customers are using the service.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import datetime
import os
sns.set()
import colorlover as cl
import copy
from IPython.display import Image

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
init_notebook_mode(connected=True)

Let's first load and visualise the data so we know what we are dealing with.

In [4]:
loc = os.getcwd()
df_prueba_selec = pd.read_csv(loc+'/prueba_seleccion.csv',dtype='str')
df_prueba_selec.fillna(method='ffill', inplace=True) # rather than just removing the NaN-values, let's keep them; 
                                                     # filling out the NaN-values with the previous value. (Failing in
                                                     # doing so will remove almost the entire dataset.)
In [5]:
pd.set_option('display.max_columns', 100)
print("Total amount of rows: {0}".format(df_prueba_selec.shape[0]))
print("Total amount of columns: {0}".format(df_prueba_selec.shape[1]))
df_prueba_selec.head(5)
Total amount of rows: 52645
Total amount of columns: 50
Out[5]:
journey_id created_at agency_id region_id client_id user_id payer_id driver_id rider_id taxi_id vehicle_type_id company_id start_type start_city start_at start_lat start_lon end_city end_at end_lat end_lon end_state hired_at arrived_at pick_up_at drop_off_at currency price price_distance price_duration distance duration moving cost cost_distance cost_duration source rider_waiting_time driver_waiting_time user_agent price_base cost_base cost_payable price_supplements cost_supplements loyalty_program_id discount start_zone_id user_karma product_id
0 3b5ecd53ea384e6dabc8f3861636e9ff 2017-12-17 00:20:33 154ec8972649d39fbad0abc2beca9fa8 madrid eff0f1083b92e22fed44c6529cadc3b4 eff0f1083b92e22fed44c6529cadc042 eff0f1083b92e22fed44c6529cadc042 8e642b8c6fdeac657c8239ee519af99e eff0f1083b92e22fed44c6529cadc042 5ab28d8422b1384ed410faaf8fa1958b 02b2d1d74090fe3d26102ace8c6dadcb 9e910b1fb98a79b76d1337688d05338a asap Madrid 2017-12-16 23:41:20 40.4232533 -3.7088242 Madrid 2017-12-17 00:20:32 40.4491499668 -3.6590950191 drop off 2017-12-16 23:41:37 2017-12-16 23:53:37 2017-12-16 23:54:26 2017-12-17 00:20:32 EUR 1,754 1,254 0 8,043 1,615 7,004 1,386 991 0 iPhone 719 49 Cabify/6.3.6 iPhone10,6 11.2.1 1,254 991 1,386 500 395 b201e810e039ba331250f7d9a95f6fcd 0 07c4a23b5636b35e9ab1095e8dae4466 8 773e238306f33909b9ee8e0c54503fa4
1 78456c6023f14e629a116e1764e938a8 2017-11-24 06:39:45 154ec8972649d39fbad0abc2beca9fa8 madrid eff0f1083b92e22fed44c6529cadc3b4 eff0f1083b92e22fed44c6529cadc042 eff0f1083b92e22fed44c6529cadc042 20b92a251a0fbc91b7bd9d2df2c9ed8e eff0f1083b92e22fed44c6529cadc042 71351c3da7e728ac0acc279a7cb8d31d 02b2d1d74090fe3d26102ace8c6dadcb 71351c3da7e728ac0acc279a7cb98e37 asap Madrid 2017-11-24 06:39:44 40.4491499668 -3.6590950191 Madrid 2017-11-24 07:06:56 40.42906 -3.7156291 drop off 2017-11-24 06:39:49 2017-11-24 06:46:08 2017-11-24 06:46:20 2017-11-24 07:06:58 EUR 1,179 1,179 0 7,001 1,237 7,148 955 955 0 iPhone 378 12 Cabify/6.3.7 iPhone10,6 11.1.1 1,179 955 955 0 0 b201e810e039ba331250f7d9a95f6fcd 0 07c4a23b5636b35e9ab1095e8dae4466 8 773e238306f33909b9ee8e0c54503fa4
2 aad8fe7a29774e0eb03c93e7271d47a7 2017-11-29 20:57:36 154ec8972649d39fbad0abc2beca9fa8 madrid eff0f1083b92e22fed44c6529cadc3b4 eff0f1083b92e22fed44c6529cadc042 eff0f1083b92e22fed44c6529cadc042 0f68d790ea419810b50397c0e13cf71f eff0f1083b92e22fed44c6529cadc042 efe88d21f21b0f7c81887bdc0c195bdc 02b2d1d74090fe3d26102ace8c6dadcb b29c0699151cb2dd3e6922f9f84c7cd3 reserved Madrid 2017-11-30 06:30:00 40.4491499668 -3.6590950191 Madrid 2017-11-30 06:54:07 40.42906 -3.7156291 drop off 2017-11-30 06:15:12 2017-11-30 06:18:52 2017-11-30 06:29:48 2017-11-30 06:54:10 EUR 1,500 1,180 0 7,255 1,443 7,149 1,215 956 0 iPhone 220 655 Cabify/6.3.7 iPhone10,6 11.1.2 1,500 1,215 1,215 0 0 b201e810e039ba331250f7d9a95f6fcd 0 07c4a23b5636b35e9ab1095e8dae4466 8 773e238306f33909b9ee8e0c54503fa4
3 64bc20e94aa444e4844551cc83605c21 2017-11-10 20:43:56 154ec8972649d39fbad0abc2beca9fa8 madrid eff0f1083b92e22fed44c6529cadc3b4 eff0f1083b92e22fed44c6529cadc042 eff0f1083b92e22fed44c6529cadc042 a2d665864791e5f611d3984b0ee67ba1 eff0f1083b92e22fed44c6529cadc042 3b4062bb5c27fe38997a4f02f538240e 02b2d1d74090fe3d26102ace8c6dadcb 31c988956d6406b3ea2aed7217deb283 asap Madrid 2017-11-10 20:43:55 40.4531004248 -3.6604817212 Madrid 2017-11-10 20:59:21 40.4327842 -3.6825787 drop off 2017-11-10 20:44:02 2017-11-10 20:48:24 2017-11-10 20:48:28 2017-11-10 20:59:23 EUR 600 598 0 5,142 650 3,622 486 484 0 iPhone 261 4 Cabify/6.3.5 iPhone8,1 11.0.3 600 486 486 0 0 b201e810e039ba331250f7d9a95f6fcd 0 07c4a23b5636b35e9ab1095e8dae4466 7 773e238306f33909b9ee8e0c54503fa4
4 249c56c03bef4c4cb988406f7314bb8c 2017-12-14 22:02:20 154ec8972649d39fbad0abc2beca9fa8 madrid eff0f1083b92e22fed44c6529cadc3b4 eff0f1083b92e22fed44c6529cadc042 eff0f1083b92e22fed44c6529cadc042 080270094892f954ef62374a2522e646 eff0f1083b92e22fed44c6529cadc042 9edc168196870b7f5c5f7e792f549b76 02b2d1d74090fe3d26102ace8c6dadcb 3042628f23e2b4527cfc7c2cfaec82ff asap Madrid 2017-12-14 21:41:58 40.4288782 -3.6861515 Madrid 2017-12-14 22:02:18 40.426929 -3.699822 drop off 2017-12-14 21:42:05 2017-12-14 21:52:21 2017-12-14 21:52:36 2017-12-14 22:02:18 EUR 550 314 0 1,524 597 1,571 434 248 0 iPhone 615 15 Cabify/6.3.6 iPhone10,6 11.2 550 434 434 0 0 b201e810e039ba331250f7d9a95f6fcd 0 07c4a23b5636b35e9ab1095e8dae4466 8 773e238306f33909b9ee8e0c54503fa4

Prepare the data to be plotted.

In [6]:
pd.set_option('display.max_rows', 100)

# Load the price, cost and distance for each trip
price = df_prueba_selec.loc[:, "price"].str.replace(',', '')
cost = df_prueba_selec.loc[:, "cost"].str.replace(',', '')
distance = df_prueba_selec.loc[:, "distance"].str.replace(',', '')

# Remove NaN values
price.fillna(0, inplace=True)
cost.fillna(0, inplace=True)
distance.fillna(0, inplace=True)
distance = pd.DataFrame(distance.astype("int"), columns=["distance"])

# Calculate the profit for each ride. Divide with 100 to get it in €.
profit = pd.DataFrame((price.astype("int")-cost.astype("int"))/100, columns=['profit'])

# Add rider_id to the Dataframe
riders = pd.DataFrame(df_prueba_selec.loc[:, "rider_id"])
cleaned_data = pd.concat([riders, profit, distance], axis=1)

Number of trips per rider

In [23]:
riders = pd.value_counts(df_prueba_selec.loc[:, "rider_id"])

data1 = go.Bar(x=riders[:226].index,
               y=riders[:226].values,
               marker=dict(color='rgba(60, 179, 113,1)'),
               name="Top 25%")

data2 = go.Bar(x=riders[226:-226].index,
               y=riders[226:-226].values,
               marker=dict(color='rgba(181,126,220,1)'),
               name="Happy Medium")

data3 = go.Bar(x=riders[-226:].index,
               y=riders[-226:].values,
               marker=dict(color='rgba(255, 165, 0, 1)'),
               name="Bottom 25%")

layout = go.Layout(
        title='Number of trips per rider. <br>(Hover for rider ID. Pan and Zoom using the tools above to the right)',
        yaxis=dict(title='Number of Trips'),
        xaxis=dict(title = 'Rider ID', showticklabels=False),
        autosize=False,
        width=900,
        height=500,
        margin=go.layout.Margin(l=120, r=120, b=100, t=100, pad=4))

fig = go.Figure(data=[data1, data2, data3], layout=layout)
py.offline.iplot(fig)


# Calculate some statistics
total_rides = riders.values.sum()
top_25 = riders.values[:226].sum()      # Sum the total rides for the top 25%
bottom_25 = riders.values[-226:].sum()  # Sum the total rides for the bottom 25%
print("Top 25% of the customers stands for {0:0.0f}% of the total rides.".format((top_25/total_rides)*100))
print("Bottom 25% of the customers stands for {0:0.1f}% of the total rides.".format((bottom_25/total_rides)*100))
Top 25% of the customers stands for 75% of the total rides.
Bottom 25% of the customers stands for 1.4% of the total rides.

Some few riders make a huge amount of total rides. As a matter of fact, 25% of the top riders make 74% of all the rides. It would be great for Cabify to build some special customer loyalty to these top 25% in order to keep them using the service.

On the other hand, focusing the effort on re-attracting the bottom 25% to start using the service again or to use it more through discounts or "free rides" might be a good way to increse these customers' use frequency. As of today, they only contribute with 1.2% of the total rides. They already know about the Cabify brand, thus, it might be easier/cheaper to convince them to use the service again, or with a higher frequency, than convincing completely new customers.

Display the 1000 longest trips.

It might be interesting to know how long the longest rides have been.

In [9]:
# Calculate the longest trips.
longest_trips = copy.deepcopy(cleaned_data)
longest_trips.sort_values(by=["distance"], inplace=True, ascending=False)

data1 = go.Bar(x=longest_trips.iloc[:47, 0].values,
               y=longest_trips.iloc[:47, 2].values/1000,
               marker=dict(color='rgba(60, 179, 113,1)'),
               name="Above 50km")  # up to position 47 included rides longer than 50km

data2 = go.Bar(x=longest_trips.iloc[47:1000, 0].values,
               y=longest_trips.iloc[47:1000, 2].values/1000,
               marker=dict(color='rgba(181,126,220,1)'),
               name="Below 50km")

layout = go.Layout(
    title='Top 1000 longest trips (out of 52,645 total trips)',
    yaxis=dict(title='Distance (km)'),
    xaxis=dict(title = 'Rider Id', showticklabels=False),
    autosize=False,
    width=1000,
    height=500,
    margin=go.layout.Margin(l=120, r=120, b=100, t=100, pad=4),
    legend=dict(x=0.85, y=.93))


fig = go.Figure(data=[data1, data2], layout=layout)
py.offline.iplot(fig)

From the plot above, we can see that there are actually only two trips that are longer than 100km. One is as long as 614km! The vast majority are shorter than 50km though. Note that many of the riders have made several trips (resulting in both green and purple piles to the left).

In [10]:
# Calculate the profit per rider and display the most profitable ones
cleaned_data_per_rider = cleaned_data.groupby(['rider_id'])['profit', 'distance'].sum()
cleaned_data_per_rider.sort_values(by=["profit"], inplace=True, ascending=False)

data1 = go.Bar(x=cleaned_data_per_rider.iloc[:181].index,
               y=cleaned_data_per_rider.iloc[:181, 0].values,
               marker=dict(color='rgba(60, 179, 113, 1)'),
               name="Top 20% most profitable riders")

data2 = go.Bar(x=cleaned_data_per_rider.iloc[181:-181].index,
               y=cleaned_data_per_rider.iloc[181:-181, 0].values,
               marker=dict(color='rgba(181,126,220,1)'),
               name="Potentially top 20%")

data3 = go.Bar(x=cleaned_data_per_rider.iloc[-181:].index,
               y=cleaned_data_per_rider.iloc[-181:, 0].values,
               marker=dict(color='rgba(255, 165, 0, 1)'),
               name="Bottom 20%")

layout = go.Layout(
            title='Riders and profitability',
            yaxis=dict(title='Profits'),
            xaxis=dict(title = 'Rider (Pan for ID)', showticklabels=False),
            width=1000,
            height=600,
            margin=go.layout.Margin(l=120, r=120, b=100, t=100, pad=5),
            legend=dict(x=0.7, y=.95))

# Plot it
fig = go.Figure(data=[data1, data2, data3], layout=layout)
py.offline.iplot(fig)

# Calculate some statistics
total_profit = cleaned_data_per_rider.loc[:, "profit"].sum()
top_20 = cleaned_data_per_rider.iloc[:181, 0].sum()      # Top 20%
bottom_20 = cleaned_data_per_rider.iloc[-181:, 0].sum()  # Bottom 20%
print("Top 20% of the customers stands for {0:0.0f}% of the total profit.".format((top_20/total_profit)*100))
print("Bottom 20% of the customers stands for {0:0.1f}% of the total profit.".format((bottom_20/total_profit)*100))
Top 20% of the customers stands for 75% of the total profit.
Bottom 20% of the customers stands for 0.4% of the total profit.

Assuming that "cost" is the total cost for Cabify to provide the service to the customer and "price" is the total price the customer pays for this service, above figure displays the profitability for each rider with the top 20 percent displayed in green (pan the figure for more info on each rider). Profit is calculated as the difference between price and cost.

As Cabify wants to keep its most profitable customers, it migth be a good idea to offer them special programs, discounts or other advantages to keep them as customers. Targeting these 20% might thus be a good idea. Interesting to note is that 75% of the profit originates from only 20% of the customers. This is in close accordance with the pareto principle, or "80/20 rule".

Subsequently, the bottom 20% might instead be targeted with campaigns with the objective to increase their profitability. As of now, they only contribute with 0.4% of the total profit. Low profitable customers are surely better than no customers at all, but Cabify would do good in understanding why these customers aren't more profitable. Maybe they had a bad first experience, maybe they can't find a ride when they need it, the price, or maybe there are other reasons? Understanding these customers' needs might make them more profitable.

Another way to choose which customers to focus a customer loyalty program on.

In [11]:
rest = copy.deepcopy(cleaned_data_per_rider)

# Get the top riders in terms of profit and distance and remove the equivalent data from the original dataframe
top_profit = rest[rest.loc[:, "profit"] >= 1000]
rest.drop(rest[rest.profit >= 1000].index, inplace=True)

top_distance = rest[rest.loc[:, "distance"] >= 4000000]
rest.drop(rest[rest.distance >= 4000000].index, inplace=True)

# Concatenate the top riders
special_treatment = pd.concat([top_profit, top_distance], axis=0)
In [12]:
x1 = rest.loc[:, "distance"].values/1000
y1 = rest.loc[:, "profit"].values
x2 = special_treatment.loc[:, "distance"].values/1000
y2 = special_treatment.loc[:, "profit"].values

data1 = go.Scatter(x = x1,
                   y = y1,
                   mode = 'markers',
                   marker=dict(color='rgba(181,126,220,1)'),
                   name="Great clients")

data2 = go.Scatter(x = x2,
                   y = y2,
                   mode = 'markers',
                   marker=dict(color='rgba(60, 179, 113, 1)'),
                   name="Excellent clients")

layout = go.Layout(title='Profit vs Distance',
                   yaxis=dict(title='Profit (€)'),
                   xaxis=dict(title = 'Distance (Km)'),
                   autosize=False,
                   width=900,
                   height=600,
                   legend=dict(x=0.8, y=.95))

fig = go.Figure(data=[data1, data2], layout=layout)
py.offline.iplot(fig)

Above figure shows us a linear relationship (a part from some exceptions) between the total distance each client has made with Cabify and their profitability. This means that the more distance a client has made, the more profitable they are as well. One way to reward clients might be to reward those that have made a total distance of more than 4000 km or earned a profit for Cabify of more than 1000 € (both displayed in green). Some clients are highly profitable (see top left of the graph) as they have contributed to high profits during relatively short distances. These are Cabify's "golden" customers - maintaining them is thus a priority!

How do riders book their CABify? Focusing marketing campaigns.

In [13]:
# Find out HOW people make their Cabify reservations
source = pd.value_counts(df_prueba_selec.loc[:, "source"])

data = [go.Bar(x=source.index,
               y=source.values,
               marker=dict(
               color='rgba(181,126,220,1)'))]

layout = go.Layout(title='How do riders book their CABify?',
                   yaxis=dict(title='Amount'),
                   xaxis=dict(title = 'Source'),
                   width=700,
                   height=400)

fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)

Combining above information, where we see that Android and iPhone are the most used sources for booking a ride, with the statistics on this site, it might be an interesting idea to cooperate with Huawei and Xiaomi (Android) as their market share in Spain is increasing while Samsung's is decreasing. A close cooperation with them might allow more potential customers to learn about Cabify!

How many Cabs, at What time and Where?

Let's start out by finding out which region is the most frequent one for Cabify rides:

In [14]:
region = pd.value_counts(df_prueba_selec.loc[:, "region_id"])

data = [go.Bar(x=region.index,
               y=region.values,
               marker=dict(
               color='rgba(181,126,220,1)'))]

layout = go.Layout(title='Which region is the most frequent one?',
                   yaxis=dict(title='Frequency'),
                   xaxis=dict(title = 'Region'),
                   width=700,
                   height=400)

fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)

As Madrid is the by far most frequent region (around 80% of the total rides), we will focus our attention here.

It migth also be interesting to know which days are the most frequent ones. First we have to add the corresponding day for each date to the original data file. This is done with the datetime module in python.

Madrid region:

In [15]:
# Use the datetime module to find out which day each date corresponds to in the dataset.
days = []
for i in range(len(df_prueba_selec)):
    day = datetime.datetime.strptime(str(df_prueba_selec.iloc[i, 22]), '%Y-%m-%d %H:%M:%S').strftime('%A')
    days.append(day)
In [16]:
# Add the corresponding days to the dataframe
df_prueba_selec['weekday'] = pd.DataFrame(days, index=df_prueba_selec.index)
df_prueba_selec.to_csv(loc+'/prueba_seleccion_added.csv')  # store it locally
(df_prueba_selec).head(1)
Out[16]:
journey_id created_at agency_id region_id client_id user_id payer_id driver_id rider_id taxi_id vehicle_type_id company_id start_type start_city start_at start_lat start_lon end_city end_at end_lat end_lon end_state hired_at arrived_at pick_up_at drop_off_at currency price price_distance price_duration distance duration moving cost cost_distance cost_duration source rider_waiting_time driver_waiting_time user_agent price_base cost_base cost_payable price_supplements cost_supplements loyalty_program_id discount start_zone_id user_karma product_id weekday
0 3b5ecd53ea384e6dabc8f3861636e9ff 2017-12-17 00:20:33 154ec8972649d39fbad0abc2beca9fa8 madrid eff0f1083b92e22fed44c6529cadc3b4 eff0f1083b92e22fed44c6529cadc042 eff0f1083b92e22fed44c6529cadc042 8e642b8c6fdeac657c8239ee519af99e eff0f1083b92e22fed44c6529cadc042 5ab28d8422b1384ed410faaf8fa1958b 02b2d1d74090fe3d26102ace8c6dadcb 9e910b1fb98a79b76d1337688d05338a asap Madrid 2017-12-16 23:41:20 40.4232533 -3.7088242 Madrid 2017-12-17 00:20:32 40.4491499668 -3.6590950191 drop off 2017-12-16 23:41:37 2017-12-16 23:53:37 2017-12-16 23:54:26 2017-12-17 00:20:32 EUR 1,754 1,254 0 8,043 1,615 7,004 1,386 991 0 iPhone 719 49 Cabify/6.3.6 iPhone10,6 11.2.1 1,254 991 1,386 500 395 b201e810e039ba331250f7d9a95f6fcd 0 07c4a23b5636b35e9ab1095e8dae4466 8 773e238306f33909b9ee8e0c54503fa4 Saturday
In [18]:
df_prueba_selec = df_prueba_selec.loc[df_prueba_selec["region_id"] == "madrid"]  # keep the Madrid region only
day = pd.value_counts(df_prueba_selec.loc[:, "weekday"])
day = day.loc[["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]] # custom sort

data1 = [go.Bar(x=day.index,
               y=day.values,
               marker=dict(
               color='rgba(181,126,220,1)'))]

layout2 = go.Layout(title='Which is the most frequent Cabify day in the Madrid area?',
                   yaxis=dict(title='Frequency'),
                   xaxis=dict(title = 'Day'),
                   width=700,
                   height=400)

fig1 = go.Figure(data=data1, layout=layout2)
py.offline.iplot(fig1)

# ***************************** 

start_city = pd.value_counts(df_prueba_selec.loc[:, "start_city"])

data2 = [go.Bar(x=start_city.iloc[:10].index,
               y=start_city.iloc[:10].values,
               marker=dict(
               color='rgba(181,126,220,1)'))]

layout2 = go.Layout(title='And where? (10 most frequent)',
                   yaxis=dict(title='Frequency'),
                   xaxis=dict(title = 'Region'),
                   width=700,
                   height=400,
                   margin=go.layout.Margin(l=50, r=75, b=100, t=50, pad=5))

fig2 = go.Figure(data=data2, layout=layout2)
py.offline.iplot(fig2)

print()
print()
print("Total number of rides in the Madrid area: {0}".format(len(df_prueba_selec)))

Total number of rides in the Madrid area: 41698

As we can se from the above plot, Thursday offers the most rides during the week. Surprisingly, the number of rides during the weekends are just a fraction of that of a weekday.

The majority of the rides start in Madrid (68%), with a significant amount starting in Boadilla del Monte as well (26%) (see above). The below picture displays all the start locations in the Madrid area, plotted in Tableau. Follow this link for access to the workbook. We can chose to focus our attention on several areas, but for the sake of this exercise let us concentrate on two main areas; Boadilla del Monte and around Nuevos Ministerios, both displayed in green. Let's see at what time Cabify should focus their fleet on these areas!

In [261]:
Image(filename = loc + "/Madrid_locations.png")
Out[261]:

Boadilla del Monte

In [19]:
B_del_Monte = copy.deepcopy(df_prueba_selec.loc[df_prueba_selec["start_city"] == "Boadilla del Monte"])

# Convert date and time to a single hour.
hours = []
for i in range(len(B_del_Monte)):
    hour = datetime.datetime.strptime(str(B_del_Monte.iloc[i, 22]), '%Y-%m-%d %H:%M:%S').strftime('%H') # get hour
    hours.append(hour)
    
# Add the corresponding hour to the dataframe
B_del_Monte['hour'] = pd.DataFrame(hours, index=B_del_Monte.index)
In [20]:
hours = pd.value_counts(B_del_Monte.loc[:, "hour"])

data = go.Bar(x = hours.index,
                  y = hours.values,
                  marker=dict(color='rgba(181,126,220,1)'))


layout = go.Layout(title='Boadilla del Monte; Occurrences per hour.',
                   yaxis=dict(title='#Occurrences'),
                   xaxis=dict(title = 'Hour'),
                   autosize=False,
                   width=900,
                   height=600)

fig = go.Figure(data=[data], layout=layout)
py.offline.iplot(fig)

As we can see from the above plot, there's a big peak in rider activity between around 17.00 and 20.00. Increasing the number of CABifys during this time in Boadilla de Monte would thus be a good idea. Further investigations can be made to learn exactly which weekdays are the most frequent ones, but as we saw earlier, Monday to Friday are the by far most frequent ones. These hours also makes sense since it's during this time people wants to get back home after finishing work.

Nuevos Ministerios

As the dataset doesn't provide us with a tag "Nuevos Ministerios", we have to find other ways to select the data of interest. One way to do this is to display all the datapoints in Tableau, select those around Nuevos Ministerios and save the file before processing it the same way as we did for Boadilla de Monte. Let's do that! The corresponding Tableau workbook can be reached through this link.

In [21]:
# Read the file obtained from the Tableau workbook.
loc = os.getcwd()
df_Nuevos_Ministerios = pd.read_csv(loc+'/Nuevos_Ministerios.csv', delimiter=";")
df_Nuevos_Ministerios.fillna(method='ffill', inplace=True)
print("Total rides from Nuevos Ministerios: {0}".format(len(df_Nuevos_Ministerios)))
df_Nuevos_Ministerios.head(1)
Total rides from Nuevos Ministerios: 3599
Out[21]:
Hour of Pick Up At In / Out of Areas of Focus Journey Id Start City Weekday Start Lat Start Lon
0 13 In fff5e700c5a7b1b91feccc2c3136fad1 Madrid Wednesday 40,4521419 -3,6903855
In [22]:
hours1 = pd.value_counts(df_Nuevos_Ministerios.loc[:, "Hour of Pick Up At"])

data = go.Bar(x = hours1.index,
                   y = hours1.values,
                   marker=dict(color='rgba(181,126,220,1)'))

layout = go.Layout(title='Nuevos Ministerios; Occurrences per hour.',
                   yaxis=dict(title='#Occurrences'),
                   xaxis=dict(title = 'Hour'),
                   autosize=False,
                   width=900,
                   height=600)

fig = go.Figure(data=[data], layout=layout)
py.offline.iplot(fig)

The rides from Nuevos Ministerios are more evenly spread across the day. However, there is a peak during the morning rush hours between 06.00 and 09.00. Increasing the fleet size in this area at this time would thus be a good idea. Keeping roughly half of the fleet until 22.00 to cover the demand would further be advisable.

Other interesting aspects to investigate:

  • Is it iPhone or Android who stands for the largest profit? If, say iPhone users stands for a substantial larger share, it might be profitable to target iPhone users (who are not yet customers) in Cabify's markets to attract them as new customers.
  • Are there any kind of Vehicle Type Ids that are more common among Cabifys' customers? I. e. do customers prefer some vehicle types over others? If so, increase the fleet with those vehicle types.
  • Are some drivers more profitable than others? (might be controversial)
In [ ]:
 
In [ ]: